Modality attention fusion model with hybrid multi-head self-attention for video understanding

Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.


Introduction
In recent years, studies on video question answering (Video-QA) based on both vision and natural language have successfully benefited from deep neural networks [1][2][3][4][5]. This task aims to select the reasoning process of the correct answer from the answer candidates in the video [6]. Machine's under-standing of images and videos is transitional from labeling the image with a few words to learn to generate a complete sentence.
Most existing methods [7][8][9][10][11] attracted a lot of attention and Video-QA has experienced outstanding advancement. Weinzaepfel et al. [12] proposed a spatio-temporal action localization approach, they applied a tracking-by-detection model which scored video with a combination of static and motion CNN features. To capture more detail in videos, Krishna et al. [13] utilized contextual information from past and future events to jointly describe all events. Lu et al. [14] proposed a multi-step semantic attention network, which learning visual relation a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 facts as semantic knowledge to help infer the correct answer. However, Video-QA tasks based on vision and natural language require the visual representation of the video combined with subtitles to infer the correct answer, so the Video-QA task is more difficult than image captioning tasks. The Video-QA task is essentially the fusion of multiple modal data to generate accurate answers to questions related to the video story. Most models for Video-QA, e.g., [15][16][17] usually use multimodal data to calculate the picture features through the deep convolutional neural network, and the question text features through the recurrent neural network, and then map the input picture and the question features to a common representation space. Finally, the common feature map vector is input to an answer classifier to determine the final answer. However, in real life, the questions people ask about pictures are often related to the target entity in the picture. Therefore, in order to further understand the characteristics of the picture, the image information representation space can be constructed by combining the objects in the picture with the understanding of visual in-formation, and inference the stage focuses on the target entity area of the image and the adjacent caption information. In addition, the characteristics of the questions can be used to focus on different target entity instances according to different questions, thereby improving the accuracy of the answers selected by the model.
In this paper, we propose a novel modal attention fusion model with hybrid multi-head selfattention for Video-QA task based on BERT [18]. We experimented with three Video-QA datasets. First, on the TVQA dataset, we showed that our system outperforms their baseline parser. Second, although the MSVD-QA and MSRVTT-QA datasets are much more challenging and open-ended nature, we are still able to achieve 36.8% and 35.26% accuracy, an absolute improvement of 2.67 points and 0.22 points over baselines, respectively. Additionally, we perform an ablation study and analysis to clarify the strengths of our model, demonstrate the effective-ness of our model in Video-QA. Finally, we demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.

Video question answering
A recent direction in Video-QA leverage text modality such as subtitle in addition to video modality for video understanding. It has been gathering a rising attention in recent years, with the release of various Video-QA benchmarks such as STAGE (Lei et al. [19]), it has proposed a Spatio-Temporal Video Question Answering task of requiring intelligent systems to simultaneously retrieve visual concepts of relevant moments to answer spatio-temporal video question, such as a dual-LSTM based approach (Jang et al. [20]) with both spatial and temporal attention, generates spatial and temporal attention to localize which regions in the video need to attend, such as a video question answering framework (Kim et al., [21]) that requires to simultaneously retrieve the relevant moments and referenced visual concepts, such as the progressive attention memory network (PAMN) proposed by Kim et al. [22], the method extracted features through progressive attention mechanism that utilizes cues from both question and answer to progressively prune out irrelevant temporal parts in memory, such as Twostream method (Lei et al. [23]) based on a bi-directional LSTM to encode both textual and visual sequences. Different from previous studies, we use BERT in our work to model the information captured in the video clips.

Self-attention for Video-QA
In the past few years, many studies have thus focused on self-attention models, which aim to identify various complex relations to answering questions. For example, Li et al. [24] proposed positional self-attention to simultaneously attend both visual and textual information for improving answer prediction. Kim et al. [25] apply the multi-head attention and self-attention networks to learn the latent concepts in scene frames and captions. Zhang et al. [26] proposed a hierarchical convolutional self-attention encoder-decoder network to efficiently model video contents from videos to answer questions. Jin et al. [27] proposed a multi-interaction network to learn the potential relations between videos and questions. Kim et al. [28] proposed the modality shifting attention network for localizes the temporal moment of interest, and predict the answer using self-attention mechanism on both video and text modality. DFAF framework [29] learn the relationship between multimodalities by adopting the self-attention mechanism.

Preliminaries
Our work is designed on Video-QA tasks with BERT and Fast R-CNN. In this section, we formally describe the BERT and Faster R-CNN models.

BERT
We use BERT (Devlin et al., [18]) as the backbone of our model architecture to represent all text data. BERT is a language representation model that used bi-directional Transformers to pretrain on a large dataset, and then used the model parameters of the pre-trained model to finetune other NLP tasks. In summary, the BERT model further increased the generalization ability of the word vector model, fully describing the character level, word level and sentence level.

Faster R-CNN
We use the Faster R-CNN [30] model to extract visual characteristics. Faster R-CNN is mainly composed of RPN network and VGG16 network. RPN network is used to generate regional candidate frames, and VGG16 network is used to extract feature maps of candidate images. Faster R-CNN detects and recognizes the target in the candidate area based on the candidate frame extracted by RPN. The overall process of Faster R-CNN has 4 units marked with different colors and shown in Fig 1. 1. Feature extraction: Faster R-CNN utilizes the VGG16 network to extract the feature map of the candidate image, which is shared for the subsequent RPN layer and fully connected layer.
2. RPN network: RPN network is used to generate candidate area frames. This layer determines that the anchor point belongs to the foreground or the background, and then uses the bounding box regression to correct the anchor frame to obtain an accurate candidate frame.
3. ROI pooling layer: This layer collects the input feature maps and candidate target regions to extract the feature maps of the target region, and then transmits them to the subsequent fully connected layer.
4. Target classification and regression: Faster R-CNN utilizes the feature map of the target area to calculate the category of the target area, and utilizes the bounding box regression to obtain the final precise position of the detection frame.

Methodology
The structure of our proposed framework aims to select the correct answer in the Video-QA tasks (shown in Fig 2). We process video and subtitle features with two independent BERT layers, and use BERT to combine visual conceptual features and subtitles and questions with each candidate answer for embedding. These input features use modal attention fusion (MAF) and a hybrid multi-head self-attention (HMS) mechanism to obtain the final answer prediction.

Input representation
Text representation. We create 5 hypotheses by concatenating a question representation q 2 R 768 with 5 candidate answer representations fa i g 5 i¼1 2 R 768 . The question and each of answer candidates were concatenated to form 5 hypotheses qa 2 R n qa �768 , where n qa represents the maximum number of tokens per hypotheses. For each hypothesis, MAF-HMS learns to predict its correctness score and to maximize the score of the correct answer.
Similarly, we create subtitle representations as s 2 R n s �768 . Video representation. At first, we extract the sequence of image frames at 3 fps for the video. Then, extract a high-level semantic representation for each image frame. CNN has been recognized as a powerful deep learning model that can capture the visual concept of the image, so our paper uses the Faster R-CNN model to extract visual characteristics v 2 R n v �768 from top-20 object proposals. As visual characteristics are in the text domain, they are embedded in the manner as the subtitle.
We extracted video representations V 2 R n v �768 , text representations S 2 R n s �768 in the subtitle, and QA pairs fQA i g 2 i¼1 2 R n qa �768 from the second-to-last layer of two BERT layers. BERT. Two separate BERT layers are first applied on the subtitle-QA (s and qa) and Video-QA (v and qa) representations. The pre-trained BERT model can be automatically finetuned to achieve the most advanced performance in various NLP tasks. The first token in each input by BERT is [CLS], which is used to obtain the output in the classification task. The [SEP] token is added to indicate the separation between the two inputs. In this article, we consider the tokens of input for two BERT layers as follows: The output of BERT layers consists of a set of text and video features. These text and video features are flattened and expressed as V 2 R n v �768 , S 2 R n s �768 , fQA i g 2 i¼1 2 R n qa �768 .

Modality attention fusion framework
Modality attention fusion framework is designed to fuse video features, subtitle features and QA features into the final MAF features, which are then fed into the hybrid multi-head selfattention layer to obtain the final prediction result.
The QA features QA 1 2 R n qa �768 and video features V 2 R n v �768 are fused together as V-QA features S VÀ QA t 2 R T QA �T V . Similarly, the QA features QA 2 2 R n qa �768 and subtitle features S 2 R n s �768 are fused together as S-QA features S SÀ QA t 2 R T QA �T S . Then, we use max pooling operation to reduce the size of the fused features of the two different modalities. The S-QA features S SÀ QA t as follows: Where fc is a fully-connected layer. The fused subtitle featuresŝ t from different directions are integrated by concatenating as follows: Similarly, we can define the fused video features as follows: We add fused subtitle features and fused video features to get the final fused features, also called MAF featuresû as follows:û

Hybrid multi-head self-attention
More recently, studies on Video-QA task [24][25][26][27][28][29] show that learning both visual and textual forms of multi-head self-attention leads to more accurate predictions. Multi-head self-attention mechanism is to map the query matrix (Q), key matrix (K) and value matrix (V) to multiple different subspaces. The subspaces are calculated without interference with each other, and finally the output is stitched together. In this paper,û Containing visual and subtitle semantic features are used as input to the multi-head self-attention layer.
Where W Q i , W K i , W V i are the linear mapping matrices of the query matrix (Q), key matrix (K), and value matrix (V) in the multi-head attention layer.
However, [31] showed that there is inevitably a limitation called the low-rank bottleneck in the self-attention mechanism. Specifically, increasing the number of heads under the premise of fixed model parameters will lead to a reduction in the size of each head, and a smaller head size will introduce rank constraints on the projection matrices of each head, resulting in a decrease in their expressiveness. Therefore, inspired by [32], we design a hybrid multi-headed self-attentive (HMS) mechanism with the aim of alleviating the low-rank bottleneck issue in multi-headed self-attentive mechanisms for connecting heads to each other to improve the expressive ability of our model.
In the hybrid multi-head self-attention, we add multiple intrinsic inter-vector similarities γ to express complex intrinsic relationships of heads on Eq 11. The hybrid multi-head self-attention model is shown in Fig 3, as illustrated, the model requires only very few additional parameters to allow the model to capture complex multimodal information. θ sÀ att ðQ; K; VÞ is redefined as: Hyb where h is the number of heads, and we introduce a new matrix g i;j 2 R h�h . γ i,j denotes the learnable parameter that can automatically measure the correlation between each head during training.
After obtaining the feature vector through hybrid multi-head self-attention, the probability y of each answer is the correct answer is predicted by the fully-connect and softmax layer:

Dataset
We evaluate our method on three video QA datasets. More details are given below. MSRVTT-QA [6].
It is an open-ended dataset that contains 10K videos and 243K question answer pairs. It consists of five types of questions, including what, how, where, when and who. Compared to the other two datasets, videos in MSRVTT-QA contain more complex scenes. They are also much longer, ranging from 10 to 30 seconds long, equivalent to 300 to 900 frames per video. Splits for train, validation and test are with the proportions are 65%, 5%, and 30%, respectively.

Baselines
We compared our model with several methods:

Two-stream (Lei et al., [23]). Combines information from different modalities with LSTMs and cross-attention.
PAMN (Kim et al., [22]). Utilize progressive attention memory to update the belief for each answer.

Implementation
In all the experiments, the recommended train/validation/test split was strictly followed, we independently repeated each experiment 10 times and reported average results of ACC.
(accuracy). We use BERT-Base uncased model, it has 12 layers and 768 hidden sizes. In our experiments, the maximum number of tokens per sequence is set to 128, the batch size is 64, the learning rate is set to 0.0001, and the epochs are set to 10. Our evaluation is performed on a machine with Intel(R) Xeon(R) Gold 6132 CPU (2.60GHz), 256G RAM and Nvidia GeForce RTX 2080 Ti.

Results
Results on TVQA. From Table 1, we noted that our approach outperforms the best previous method by 1.53/1.74 accuracy points on the test/val. As compared to the STAGE, we noted the performance is substantially improved. Considering that there are 15,253 and 7,623 validation and test questions, this establishes the strength of our task. We can see that, our model outperforms the baseline models by a large margin. In particular, the scores of our model across all the TV shows are more balanced than the scores from other models, which mean our model is more consistent and robust.
We believe the reasons behind the performance boost are the following: 1. Compared with LSTM-based models (Two-stream, PAMN, Multi-task), BERT-based models (STAGE and MAF-HMS) enable capturing longer dependencies between and within different modalities, especially when there is a long subtitle.
2. Our approach can properly integrate input features from different modalities to help answer questions.
3. Hybrid multi-head self-attention can more fully consider the contribution of each modality, and fusion of multi-head results can make the model extract more important features more accurately and improve the performance of our model. Fig 4, we compare our MAF-HMS with Two-stream, PAMN, Multi-task and STAGE on MSVD-QA dataset. Our MAF-HMS achieves the most promising performance in overall accuracy, which demonstrates the superiority of the proposed method on the non-trivial scenarios.

Results on MSVD-QA and MSRVTT-QA. As is shown in
The MSVD-QA and MSRVTT-QA datasets represent highly challenging benchmarks for machine compared to the TVQA, thanks to their open-ended nature. Our MAF-HMS model outperforms existing methods on both datasets, achieving 36.82% and 35.26% accuracy which are 2.67 points and 0.22 points improvement on MSVD-QA and MSRVTT-QA, respectively. This suggests that the model can handle both small and large datasets better than existing methods.

Ablation study
For further analysis, our model shows substantial improvement over the baseline, we evaluated our models with different variants. In these experiments, we choose the same train-and-test setting. From Table 2, we have the following observations. Effect of model. We design a simpler model with GloVe and LSTM for text representation. As is shown in Table 2, we summarize the ablation analysis of MAF-HMS on TVQA dataset in order to measure the validity of the key components of MAF-HMS. Using BERT for contextual word embeddings significantly improves performance comparing to GloVe embedding and LSTM counterpart.
Effect of variants. To measure to effectiveness of all components, we evaluate the following settings: MAF-HMS w/ JMCQ. JMCQ (Video-Text Fusion) is the modal fusion mechanism of Twostream.
The three "w/" variants have little performance improvement compared to MAF-HMS w/o MAF, and our MAF-HMS has increased by 3.96 points. Thus, MAF component contribution is huge. Finally, we compare our method with two simple variants that drop the HMS and replace the HMS with a standard multi-headed self-attentive unit to show the contribution of our hybrid multi-head self-attention unit. Without HMS, we find out that the performance of our MAF-HMS model decreases by 1.08 points on TVQA dataset. We adopt the multi-headed self-attentive to replace the HMS, the result demonstrates the positive role of HMS.

Qualitative results
In this section, we provide analysis the qualitative examples of TVQA benchmark solved by MAF-HMS. We selected a few successful and unsuccessful prediction samples from TVQA dataset. Following, we list some of these observations. For example, the utterance "How did you come to work today" is very important to answer the question, information from neighboring utterances, e.g., "By bus" is the key message to answer the question. All this information is contained in a single modality and is easy to capture, whether it is baselines or our model. Such contextual relationships are prevalent throughout the TVQA dataset. In addition, these are difficult to find the answer information from the subtitles. If there is a lack of key information in the subtitles, such as the question, "what did he hand her", if the context dependence is not obvious in the subtitles, we can find the answer from the video features, e.g., "A man's hand holding a hamburger". There are strongly related entities in the visual features, our method will give correct predictions based on these fusion features.
We further investigate the performance of MAF-HMS by comparing with STAGE on TVQA dataset. For example, the question "Where did she go after she confessed to him", there are strongly related entities in the visual features, and the answer can be found from the visual modalities, both MAF-HMS and STAGE obtain accurate prediction for such questions. However, in another question "what did he hand her", there is some disturbing information with confusion. We can find the wrong answer "diaper bag" from the subtitle "well, this is his diaper bag", and find the correct answer "hamburger" in the video representation, there are strongly related entities in the question features. Other answers can be found in subtitles and video representations, and our model can more accurately eliminate interference options and find the answer directly from the subtitles with understanding the semantics of the question better.
Finally, we show two failure cases. For example, in the question "Who tells Anton that he has quite the temper when interrogating him in the interrogation room?" and the answer is the names of 5 characters, character's face which is very challenging to capture using the visual concepts extracted by Faster R-CNN. MAF-HMS fails to predict the correct answer as the visual concept features are insufficient in capturing textual cues in the video. Furthermore, in another question "Who is sitting at the computer when the group is talking", our model is difficult to find the action concept features associated with the question and answer from the subtitles and visual features, so we predict failure. When the answer is not explicit in the video frames or the subtitles, our method gives an incorrect prediction, however our MAF-HMS still achieves outstanding results.
From these examples, we can see that our model is able to solve both visual and language modalities multiple-choice questions.

Performance by question type
In this section, we evaluate the proposed model with different question types on TVQA dataset.
This section describes the analysis of MAF-HMS by comparing the accuracy with respect to question type. The For fair comparison with existing methods, we tried to reproduce the results on Twostream, PAMN, Multi-task and STAGE. Notwithstanding the effect achieved on "who" question is not satisfactory, MAF-HMS obtains an average of 72.19% accuracy for the majority of question types, significantly higher than other baselines. Especially, reached 81.93% accuracy performance on "when" question. The reason is that "when" type sentences contain more interactions and inferences between video representation and text representation, and our model can further integrate patterns with BERT and Faster R-CNN to answer multiple choice questions more effectively. The results confirm that our approach shows additional benefit in mining the modal relations between text and video representation, which implies the superiority of MAF-HMS to help infer the correct answer.

Performance by required modality
To investigate whether MAF-HMS is sensitive to the modality entities, we provide the analysis of MAF-HMS by the required modality for all datasets. For comparison purpose, we also designed three types of labels according to which modality is required for answering prediction.
V. There are only strongly related entities in the visual features.
S. There are only strongly related entities in the subtitle features.

V&S.
There are strongly related entities in the visual features and subtitle features.
As is shown in Fig 6, we have manually labeled 3000 (1000 per each type) examples for each dataset.

PLOS ONE
It can be seen that require subtitle and video modalities for answer prediction (i.e., label V&S) achieves the highest accuracy score among all datasets, label V has the lowest accuracy score. It means that effectively fusing multi-modal information can strengthen the model's ability to understand video, and achieving the impressive performance on multi-modal tasks.

Conclusion
In this work, we presented the MAF-HMS model for Video-QA tasks. We use a modality attention fusion framework that combines visual and subtitles representation features to capture the semantics more accurately. We process video and subtitle features with two independent BERT layers, and use BERT to combine visual conceptual features and subtitles and questions with each candidate answer for embedding. These input features use modal attention fusion and a hybrid multi-head self-attention mechanism to obtain the final answer prediction.
Experiments were conducted to test the performance of our model. Results show that our model gave correct predictions from the language and visual representations on TVQA dataset. In our experiments, the MAF-HMS was evaluated on multiple Video QA datasets, especially, has achieved a test accuracy rate of 72.21% on TVQA dataset. Although the MSVD-QA and MSRVTT-QA datasets are much more challenging and open-ended nature, we are still able to achieve 36.82% and 35.26% accuracy which are 2.67 points and 0.22 points improvement on MSVD-QA and MSRVTT-QA. In addition, we perform an ablation study and analysis to clarify the strengths of our model, demonstrate the effectiveness of our model in Video-QA. Finally, experimental results demonstrate the effectiveness of our method in question type and required modality tasks.